Overview of gene expression at the bin level

Which bins are mostly expressed in the two conditions? We will integrate this info with taxonomic assignment. Complete taxonomic assignments are available here.

The following heatmap shows expression levels (in TPMs) grouped by bin. You can hover each cell of the heatmap to get more info.

Now we’ll try to integrate bin abundance in metagenomic samples.

With ggplot2 geom_tile

Try with pheatmap

Methods

Use DESeq2 on salmon output.

Exploratory analysis and visualization

Prefilter the dataset (for visualization purposes only). Raw count data, how many genes?

## [1] 116479

We can keep rows (= genes) with more than 1 count across all samples. How many genes do we have now?

## [1] 51991

Pick a method to visualise sample relationships

Source: https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#exploratory-analysis-and-visualization

We need to choose between transformation methods for running exploratory analyses.

Scatterplot of transformed counts from two samples.

Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment’s design? We use the vst-transformed results to:

  • draw a heatmap of sample-sample distances
  • run a PCA analysis.

Heatmap of sample-to-sample distances using the variance stabilizing transformed values.

PCA analysis

Heatmap showing how much each gene deviates in a specific sample from the gene’s average (variance) across all samples. Please note: this is NOT a list of differentially expressed genes!!! Top 20 genes are shown. The complete table is shown here in interactive html format and here in csv format.

Details about the heatmap construction… In the sample distance heatmap made previously, the dendrogram at the side shows us a hierarchical clustering of the samples. Such a clustering can also be performed for the genes. Since the clustering is only relevant for genes that actually carry a signal, one usually would only cluster a subset of the most highly variable genes. Here, for demonstration, let us select the 20 genes with the highest variance across samples. We will work with the VST data.

The heatmap becomes more interesting if we do not look at absolute expression strength but rather at the amount by which each gene deviates in a specific sample from the gene’s average across all samples. Hence, we center each genes’ values across samples, and plot a heatmap (figure below).

Differential gene expression analysis

Aims:

  • Perform DGE
  • Group by bin
  • Group by taxonomy
  • Group by functional annotation

Summary of DGE analysis (I)

  • 116479 genes
  • 82452 genes with non-zero counts (from summary(res))
  • 21171 genes are up-regulated (LFC > 0)
  • 9507 genes are down-regulated (LFC < 0)

An interactive, detailed report of the DGE analysis is available here.

Supporting files

These files have already been linked through this report, but are collected here for convenience.